This exercise relied on the twitter API, which is no longer available. However a new version of the academic API appears to have recently been made available again. Unsure how this will develop. We will use twitter data collected in 2020 for this exercise.
In this tutorial, you will learn how to:
The hands-on exercise for this week uses dictionary-based methods for filtering and scoring words. Dictionary-based methods use pre-generated lexicons, which are no more than list of words with associated scores or variables measuring the valence of a particular word. In this sense, the exercise is not unlike our analysis of Edinburgh Book Festival event descriptions. Here, we were filtering descriptions based on the presence or absence of a word related to women or gender. We can understand this approach as a particularly simple type of “dictionary-based” method. Here, our “dictionary” or “lexicon” contained just a few words related to gender.
Before proceeding, we’ll load the remaining packages we will need for this tutorial.
library(kableExtra)
library(tidyverse) # loads dplyr, ggplot2, and others
library(readr) # more informative and easy way to import data
library(stringr) # to handle text elements
library(tidytext) # includes set of functions useful for manipulating text
library(quanteda) # includes functions to implement Lexicoder
library(textdata)
library(academictwitteR) # for fetching Twitter data
First off: always check that you have the right working directory
getwd()
## [1] "/Users/marionlieutaud/Dropbox (Personal)/My Mac (MacBook-Pro-3.local)/Documents/GitHub/CTA-ED-exercise2"
In this exercise we’ll be using another new dataset. The data were collected from the Twitter accounts of the top eight newspapers in the UK by circulation. You can see the names of the newspapers in the code below:
# This is a code chunk to show the code that collected the data using the twitter API, back in 2020.
# You don't need to run this, and this chunk of code will be ignored when you knit to html, thanks to the 'eval=FALSE' command in the chunk option.
newspapers = c("TheSun", "DailyMailUK", "MetroUK", "DailyMirror",
"EveningStandard", "thetimes", "Telegraph", "guardian")
tweets <-
get_all_tweets(
users = newspapers,
start_tweets = "2020-01-01T00:00:00Z",
end_tweets = "2020-05-01T00:00:00Z",
data_path = "data/sentanalysis/",
n = Inf,
)
tweets <-
bind_tweets(data_path = "data/sentanalysis/", output_format = "tidy")
saveRDS(tweets, "data/sentanalysis/newstweets.rds")
You can download the tweets data directly from the source in the following way: the data was collected by Chris Barrie and is stored on his Github page.
tweets <- readRDS(gzcon(url("https://github.com/cjbarrie/CTA-ED/blob/main/data/sentanalysis/newstweets.rds?raw=true")))
Let’s have a look at the data:
head(tweets)
## # A tibble: 6 × 31
## tweet_id user_username text lang author_id source possibly_sensitive
## <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 121233440226652… DailyMirror "Sec… en 16887175 Tweet… FALSE
## 2 121233416945767… DailyMirror "RT … en 16887175 Tweet… FALSE
## 3 121233319587999… thetimes "A c… en 6107422 Echob… FALSE
## 4 121233319486498… TheSun "Way… en 34655603 Echob… FALSE
## 5 121233292050719… DailyMailUK "Stu… en 111556423 Socia… FALSE
## 6 121233264057087… TheSun "Dad… en 34655603 Twitt… FALSE
## # ℹ 24 more variables: conversation_id <chr>, created_at <chr>, user_url <chr>,
## # user_location <chr>, user_protected <lgl>, user_verified <lgl>,
## # user_name <chr>, user_profile_image_url <chr>, user_description <chr>,
## # user_created_at <chr>, user_pinned_tweet_id <chr>, retweet_count <int>,
## # like_count <int>, quote_count <int>, user_tweet_count <int>,
## # user_list_count <int>, user_followers_count <int>,
## # user_following_count <int>, sourcetweet_type <chr>, sourcetweet_id <chr>, …
colnames(tweets)
## [1] "tweet_id" "user_username" "text"
## [4] "lang" "author_id" "source"
## [7] "possibly_sensitive" "conversation_id" "created_at"
## [10] "user_url" "user_location" "user_protected"
## [13] "user_verified" "user_name" "user_profile_image_url"
## [16] "user_description" "user_created_at" "user_pinned_tweet_id"
## [19] "retweet_count" "like_count" "quote_count"
## [22] "user_tweet_count" "user_list_count" "user_followers_count"
## [25] "user_following_count" "sourcetweet_type" "sourcetweet_id"
## [28] "sourcetweet_text" "sourcetweet_lang" "sourcetweet_author_id"
## [31] "in_reply_to_user_id"
Each row here is a tweets produced by one of the news outlets detailed above over a five month period, January–May 2020. Note also that each tweets has a particular date. We can therefore use these to look at any over time changes.
We won’t need all of these variables so let’s just keep those that are of interest to us:
tweets <- tweets %>%
select(user_username, text, created_at, user_name,
retweet_count, like_count, quote_count) %>%
rename(username = user_username,
newspaper = user_name,
tweet = text)
| username | tweet | created_at | newspaper | retweet_count | like_count | quote_count |
|---|---|---|---|---|---|---|
| EveningStandard | We can’t complain: Two men spend coronavirus lockdown in London pub with ‘fresh beer on tap’ 🍺 https://t.co/rG65nGWv6q | 2020-04-30T23:43:24.000Z | Evening Standard | 3 | 4 | 0 |
| EveningStandard | Best home spa treatments: face, body, nail and hair products for home https://t.co/nDZ65BbbVs | 2020-04-30T23:57:09.000Z | Evening Standard | 0 | 2 | 0 |
| guardian | Coronavirus live news: Trump claims to have evidence virus started in Wuhan lab as UK is ‘past the peak’ https://t.co/LZv4yx2kn2 | 2020-04-30T23:57:59.000Z | The Guardian | 19 | 40 | 8 |
| guardian | Rugby league gets £16m emergency loan from government https://t.co/kZ9PP4aWjO | 2020-04-30T23:57:59.000Z | The Guardian | 9 | 12 | 0 |
| guardian | Coronavirus latest: at a glance https://t.co/OrWrEdOwoU | 2020-04-30T23:58:01.000Z | The Guardian | 8 | 11 | 3 |
We manipulate the data into tidy format again, unnesting each token (here: words) from the tweet text.
tidy_tweets <- tweets %>%
mutate(desc = tolower(tweet)) %>%
unnest_tokens(word, desc) %>%
filter(str_detect(word, "[a-z]"))
We’ll then tidy this further, as in the previous example, by removing stop words:
tidy_tweets <- tidy_tweets %>%
filter(!word %in% stop_words$word)
Several sentiment dictionaries come bundled with the tidytext package. These are:
AFINN from Finn Årup Nielsen,bing from Bing Liu and collaborators, andnrc from Saif Mohammad and Peter TurneyWe can have a look at some of these to see how the relevant dictionaries are stored.
get_sentiments("afinn")
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
What do we see here. First, the AFINN lexicon gives words a score from -5 to +5, where more negative scores indicate more negative sentiment and more positive scores indicate more positive sentiment. The nrc lexicon opts for a binary classification: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust, with each word given a score of 1/0 for each of these sentiments. In other words, for the nrc lexicon, words appear multiple times if they enclose more than one such emotion (see, e.g., “abandon” above). The bing lexicon is most minimal, classifying words simply into binary “positive” or “negative” categories.
Let’s see how we might filter the texts by selecting a dictionary, or subset of a dictionary, and using inner_join() to then filter out tweet data. We might, for example, be interested in fear words. Maybe, we might hypothesize, there is a uptick of fear toward the beginning of the coronavirus outbreak. First, let’s have a look at the words in our tweet data that the nrc lexicon codes as fear-related words.
nrc_fear <- get_sentiments("nrc") %>%
filter(sentiment == "fear")
tidy_tweets %>%
inner_join(nrc_fear) %>%
count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 1,173 × 2
## word n
## <chr> <int>
## 1 mum 4509
## 2 death 4073
## 3 police 3275
## 4 hospital 2240
## 5 government 2179
## 6 pandemic 1877
## 7 fight 1309
## 8 die 1199
## 9 attack 1099
## 10 murder 1064
## # ℹ 1,163 more rows
We have a total of 1,174 words with some fear valence in our tweet data according to the nrc classification. Several seem reasonable (e.g., “death,” “pandemic”); others seems less so (e.g., “mum,” “fight”).
Do we see any time trends? First let’s make sure the data are properly arranged in ascending order by date. We’ll then add column, which we’ll call “order,” the use of which will become clear when we do the sentiment analysis.
#gen data variable, order and format date
tidy_tweets$date <- as.Date(tidy_tweets$created_at)
tidy_tweets <- tidy_tweets %>%
arrange(date)
tidy_tweets$order <- 1:nrow(tidy_tweets)
Remember that the structure of our tweet data is in a one token (word) per document (tweet) format. In order to look at sentiment trends over time, we’ll need to decide over how many words to estimate the sentiment.
In the below, we first add in our sentiment dictionary with inner_join(). We then use the count() function, specifying that we want to count over dates, and that words should be indexed in order (i.e., by row number) over every 1000 rows (i.e., every 1000 words).
This means that if one date has many tweets totalling >1000 words, then we will have multiple observations for that given date; if there are only one or two tweets then we might have just one row and associated sentiment score for that date.
We then calculate the sentiment scores for each of our sentiment types (positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) and use the spread() function to convert these into separate columns (rather than rows). Finally we calculate a net sentiment score by subtracting the score for negative sentiment from positive sentiment.
#get tweet sentiment by date
tweets_nrc_sentiment <- tidy_tweets %>%
inner_join(get_sentiments("nrc")) %>%
count(date, index = order %/% 1000, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 7711 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
tweets_nrc_sentiment %>%
ggplot(aes(date, sentiment)) +
geom_point(alpha=0.5) +
geom_smooth(method= loess, alpha=0.25)
## `geom_smooth()` using formula = 'y ~ x'
How do our different sentiment dictionaries look when compared to each other? We can then plot the sentiment scores over time for each of our sentiment dictionaries like so:
tidy_tweets %>%
inner_join(get_sentiments("bing")) %>%
count(date, index = order %/% 1000, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
ggplot(aes(date, sentiment)) +
geom_point(alpha=0.5) +
geom_smooth(method= loess, alpha=0.25) +
ylab("bing sentiment")
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 54114 of `x` matches multiple rows in `y`.
## ℹ Row 3848 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## `geom_smooth()` using formula = 'y ~ x'
tidy_tweets %>%
inner_join(get_sentiments("nrc")) %>%
count(date, index = order %/% 1000, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
ggplot(aes(date, sentiment)) +
geom_point(alpha=0.5) +
geom_smooth(method= loess, alpha=0.25) +
ylab("nrc sentiment")
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 7711 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## `geom_smooth()` using formula = 'y ~ x'
tidy_tweets %>%
inner_join(get_sentiments("afinn")) %>%
group_by(date, index = order %/% 1000) %>%
summarise(sentiment = sum(value)) %>%
ggplot(aes(date, sentiment)) +
geom_point(alpha=0.5) +
geom_smooth(method= loess, alpha=0.25) +
ylab("afinn sentiment")
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
## `geom_smooth()` using formula = 'y ~ x'
We see that they do look pretty similar… and interestingly it seems that overall sentiment positivity increases as the pandemic breaks.
Of course, list- or dictionary-based methods need not only focus on sentiment, even if this is one of their most common uses. In essence, what you’ll have seen from the above is that sentiment analysis techniques rely on a given lexicon and score words appropriately. And there is nothing stopping us from making our own dictionaries, whether they measure sentiment or not. In the data above, we might be interested, for example, in the prevalence of mortality-related words in the news. As such, we might choose to make our own dictionary of terms. What would this look like?
A very minimal example would choose, for example, words like “death” and its synonyms and score these all as 1. We would then combine these into a dictionary, which we’ve called “mordict” here.
word <- c('death', 'illness', 'hospital', 'life', 'health',
'fatality', 'morbidity', 'deadly', 'dead', 'victim')
value <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
mordict <- data.frame(word, value)
mordict
## word value
## 1 death 1
## 2 illness 1
## 3 hospital 1
## 4 life 1
## 5 health 1
## 6 fatality 1
## 7 morbidity 1
## 8 deadly 1
## 9 dead 1
## 10 victim 1
We could then use the same technique as above to bind these with our data and look at the incidence of such words over time. Combining the sequence of scripts from above we would do the following:
tidy_tweets %>%
inner_join(mordict) %>%
group_by(date, index = order %/% 1000) %>%
summarise(morwords = sum(value)) %>%
ggplot(aes(date, morwords)) +
geom_bar(stat= "identity") +
ylab("mortality words")
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
The above simply counts the number of mortality words over time. This might be misleading if there are, for example, more or longer tweets at certain points in time; i.e., if the length or quantity of text is not time-constant.
Why would this matter? Well, in the above it could just be that we have more mortality words later on because there are just more tweets earlier on. By just counting words, we are not taking into account the denominator.
An alternative, and preferable, method here would simply take a character string of the relevant words. We would then sum the total number of words across all tweets over time. Then we would filter our tweet words by whether or not they are a mortality word or not, according to the dictionary of words we have constructed. We would then do the same again with these words, summing the number of times they appear for each date.
After this, we join with our data frame of total words for each date. Note that here we are using full_join() as we want to include dates that appear in the “totals” data frame that do not appear when we filter for mortality words; i.e., days when mortality words are equal to 0. We then go about plotting as before.
mordict <- c('death', 'illness', 'hospital', 'life', 'health',
'fatality', 'morbidity', 'deadly', 'dead', 'victim')
#get total tweets per day (no missing dates so no date completion required)
totals <- tidy_tweets %>%
mutate(obs=1) %>%
group_by(date) %>%
summarise(sum_words = sum(obs))
#plot
tidy_tweets %>%
mutate(obs=1) %>%
filter(grepl(paste0(mordict, collapse = "|"),word, ignore.case = T)) %>%
group_by(date) %>%
summarise(sum_mwords = sum(obs)) %>%
full_join(totals, word, by="date") %>%
mutate(sum_mwords= ifelse(is.na(sum_mwords), 0, sum_mwords),
pctmwords = sum_mwords/sum_words) %>%
ggplot(aes(date, pctmwords)) +
geom_point(alpha=0.5) +
geom_smooth(method= loess, alpha=0.25) +
xlab("Date") + ylab("% mortality words")
## `geom_smooth()` using formula = 'y ~ x'
The above approaches use general dictionary-based techniques that were not designed for domain-specific text such as news text. The Lexicoder Sentiment Dictionary, by @young_affective_2012 was designed specifically for examining the affective content of news text. In what follows, we will see how to implement an analysis using this dictionary.
We will conduct the analysis using the quanteda package. You will see that we can tokenize text in a similar way using functions included in the quanteda package.
With the quanteda package we first need to create a “corpus” object, by declaring our tweets a corpus object. Here, we make sure our date column is correctly stored and then create the corpus object with the corpus() function. Note that we are specifying the text_field as “tweet” as this is where our text data of interest is, and we are including information on the date that tweet was published. This information is specified with the docvars argument. You’ll see then that the corpus consists of the text and so-called “docvars,” which are just the variables (columns) in the original dataset. Here, we have only included the date column.
tweets$date <- as.Date(tweets$created_at)
tweet_corpus <- corpus(tweets, text_field = "tweet", docvars = "date")
## Warning: docvars argument is not used.
We then tokenize our text using the tokens() function from quanteda, removing punctuation along the way:
toks_news <- tokens(tweet_corpus, remove_punct = TRUE)
We then take the data_dictionary_LSD2015 that comes bundled with quanteda and and we select only the positive and negative categories, excluding words deemed “neutral.” After this, we are ready to “look up” in this dictionary how the tokens in our corpus are scored with the tokens_lookup() function.
# select only the "negative" and "positive" categories
data_dictionary_LSD2015_pos_neg <- data_dictionary_LSD2015[1:2]
toks_news_lsd <- tokens_lookup(toks_news, dictionary = data_dictionary_LSD2015_pos_neg)
This creates a long list of all the texts (tweets) annotated with a series of ‘positive’ or ‘negative’ annotations depending on the valence of the words in that text. The creators of quanteda then recommend we generate a document feature matric from this. Grouping by date, we then get a dfm object, which is a quite convoluted list object that we can plot using base graphics functions for plotting matrices.
# create a document document-feature matrix and group it by date
dfmat_news_lsd <- dfm(toks_news_lsd) %>%
dfm_group(groups = date)
# plot positive and negative valence over time
matplot(dfmat_news_lsd$date, dfmat_news_lsd, type = "l", lty = 1, col = 1:2,
ylab = "Frequency", xlab = "")
grid()
legend("topleft", col = 1:2, legend = colnames(dfmat_news_lsd), lty = 1, bg = "white")
# plot overall sentiment (positive - negative) over time
plot(dfmat_news_lsd$date, dfmat_news_lsd[,"positive"] - dfmat_news_lsd[,"negative"],
type = "l", ylab = "Sentiment", xlab = "")
grid()
abline(h = 0, lty = 2)
Alternatively, we can recreate this in tidy format as follows:
negative <- dfmat_news_lsd@x[1:121]
positive <- dfmat_news_lsd@x[122:242]
date <- dfmat_news_lsd@Dimnames$docs
tidy_sent <- as.data.frame(cbind(negative, positive, date))
tidy_sent$negative <- as.numeric(tidy_sent$negative)
tidy_sent$positive <- as.numeric(tidy_sent$positive)
tidy_sent$sentiment <- tidy_sent$positive - tidy_sent$negative
tidy_sent$date <- as.Date(tidy_sent$date)
And plot accordingly:
tidy_sent %>%
ggplot() +
geom_line(aes(date, sentiment))
First the subsetting and some preprocessing
# to subset means to take only a sample according to a specific condition. Here I'm going to look only at tabloid media.
tweets_tabloid <- tweets %>%
filter(newspaper %in% c("The Mirror", "The Sun", "Daily Mail U.K.", "Metro"))
# create corpus
tweets_tabloid_corpus <- corpus(tweets_tabloid, text_field = "tweet")
# check that all docvars have been correctly recognised
names(docvars(tweets_tabloid_corpus))
## [1] "username" "created_at" "newspaper" "retweet_count"
## [5] "like_count" "quote_count" "date"
# tokenising and tidying
toks_tabloid <- tokens(tweets_tabloid_corpus,
remove_punct = TRUE, # remove punctuation
remove_url = TRUE, # remove urls
remove_numbers = TRUE, # remove numbers
remove_symbols = TRUE) %>% # remove symbols
tokens_select(pattern = stopwords("en"), selection = "remove") %>% # remove stopwords
tokens_tolower()
We will need a denominator for the frequency of our sentiment words, so we need to calculate a total for each tabloid. We could could calculate those totals based on different stages it on different degrees of preprocessing. Here I want to calculate the total tokens after preprocessing (i.e. without punctuation, stopwords or urls), so I base the total on the tokens object we just created (toks_tabloid)
# now calculate total tokens for each newspaper
total_dfm_tabloid <- dfm(toks_tabloid) %>%
dfm_group(groups = newspaper) %>% # group the dfm by newspaper
convert(to = "data.frame") %>% # convert to data frame so it's easier to manipulate
group_by(doc_id) %>% # group the data frame by newspaper for the calculation
reframe(total = rowSums(across(everything()))) # calculate total for each row (total tokens)
# have a look at the first rows to check all looks good
head(total_dfm_tabloid)
## # A tibble: 4 × 2
## doc_id total
## <chr> <dbl>
## 1 Daily Mail U.K. 191408
## 2 Metro 65434
## 3 The Mirror 463809
## 4 The Sun 291059
Now we move to the sentiment analysis. I use the NRC dictionary but I prefer to use quanteda so I first reformat it into a quanteda dictionary object, so I can then refer to it within the quanteda command ‘token_lookup()’
# turn the NRC dictionary into a quanteda dictionary
data_dictionary_NRC <- get_sentiments("nrc")
data_dictionary_NRC <- as.dictionary(data_dictionary_NRC)
#get tweet sentiments by newspaper
toks_tabloid_nrc <- toks_tabloid %>%
tokens_lookup(dictionary = data_dictionary_NRC)
# turn into document feature matrix (dfm)
dfm_tabloid_nrc <- dfm(toks_tabloid_nrc) %>%
dfm_group(groups = newspaper) %>%
convert(to = "data.frame") # convert to data frame
# join with the totals by newspaper
dfm_tabloid_nrc <- dfm_tabloid_nrc %>%
full_join(total_dfm_tabloid, by="doc_id") %>%
rename("newspaper" = doc_id) # rename the 'doc_id' column to 'newspaper'
# let's have a look at the numbers by newspaper and by sentiment
kable(dfm_tabloid_nrc)
| newspaper | anger | anticipation | disgust | fear | joy | negative | positive | sadness | surprise | trust | total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Daily Mail U.K. | 8167 | 8382 | 4303 | 11326 | 7215 | 16355 | 16528 | 8830 | 4476 | 10874 | 191408 |
| Metro | 1833 | 2861 | 1213 | 2935 | 2225 | 4090 | 5132 | 2200 | 1501 | 3307 | 65434 |
| The Mirror | 20732 | 19903 | 13258 | 33365 | 16807 | 44747 | 37089 | 23482 | 12292 | 23589 | 463809 |
| The Sun | 9879 | 13081 | 6383 | 15732 | 13235 | 22553 | 25230 | 10131 | 7739 | 13970 | 291059 |
Now we can calculate relative frequencies by sentiment and format for plotting. Using mutate() with across() allows you to modify multiple columns at once.
dfm_tabloid_nrc_pct <- dfm_tabloid_nrc %>%
mutate(across(c("anger":"trust"), ~round((.x/total)*100, digits=1)))
# here the instruction is that each row value (x) in columns from "anger" to "trust" should be divided by the row total, and then rounded.
# you could also just do it with simpler code, one column at at time, with more lines of code. It would look like this:
# mutate(anger = round((anger/total)*100, digits=1),
# trust = round((trust/total)*100, digits=1)) # etc...
# we pivot the data to 'long' format to make it easier to plot
dfm_tabloid_nrc_pct <- dfm_tabloid_nrc_pct %>%
select(-total) %>% # remove the 'total' column
pivot_longer(c(anger:trust), names_to = "sentiment", values_to = "frequency")
# plot by newspaper
dfm_tabloid_nrc_pct %>%
ggplot() + # when we enter ggplot environment we need to use '+' not '%>%',
geom_col(aes(x=newspaper, y=frequency, group=sentiment, fill=newspaper)) + # reordering newspaper variable so it is displayed from most negative to most positive
coord_flip() + # pivot plot by 90 degrees
facet_wrap(~sentiment, nrow = 2) + # create multiple plots for each
ylab("Sentiment relative frequency") + # label y axis
scale_fill_manual(values = c("blue", "darkblue", "red", "pink")) + # pick the colours
guides(fill = "none") + # no need to show legend for colour
theme_minimal() # pretty graphic theme
Tabloids don’t differ much on trust, surprise or anticipation; relative to the others,the Mirror and the Daily Mail use more words associated with anger, fear and sadness, while the Sun uses slightly more joyful words. Overall, the Mirror and the Daily Mail are more negative and use a variety of negative sentiments (fear, anger etc) more extensively than the other two. Metro seems less likely to invoke any sentiment-connotated word, as it shows lower frequency in almost all sentiment categories. This may be a sign that the writing in Metro is somewhat less sensationalist.
# first we do this using only the tidyverse
trans_words <- c('trans', 'transgender', 'trans rights', 'trans rights activists', 'transphobic', 'terf', 'terfs', 'transphobia', 'transphobes', 'gender critical', 'LGBTQ', 'LGBTQ+')
#get total tweets per day (no missing dates so no date completion required)
totals_newspaper <- tidy_tweets %>%
mutate(obs=1) %>%
group_by(newspaper) %>%
summarise(sum_words = sum(obs))
#plot
tidy_tweets %>%
mutate(obs=1) %>%
filter(grepl(paste0(trans_words, collapse = "|"), word, ignore.case = T)) %>%
group_by(newspaper) %>%
summarise(sum_mwords = sum(obs)) %>%
full_join(totals_newspaper, word, by="newspaper") %>%
mutate(sum_mwords= ifelse(is.na(sum_mwords), 0, sum_mwords),
pcttranswords = sum_mwords/sum_words) %>%
ggplot(aes(x=reorder(newspaper, -pcttranswords), y=pcttranswords)) +
geom_point() +
xlab("newspaper") + ylab("% words referring to trans or terfs") +
coord_flip() +
theme_minimal()
The Sun looks like it discusses trans people and trans rights (or transphobia) particularly often.
# we go back to the initial corpus
toks_news <- tokens(tweet_corpus,
remove_punct = TRUE,
remove_url = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_select(pattern = stopwords("en"), selection = "remove")
toks_news_lsd <- tokens_lookup(toks_news,
dictionary = data_dictionary_LSD2015_pos_neg)
# recreate a document-feature matrix but instead of grouping it by date, we group it by 'username' (aka newspapers)
dfm_news_lsd <- dfm(toks_news_lsd) %>%
dfm_group(groups = username)
# convert it to a dataframe so it's easier to use
tidy_dfm_news_lsd <- dfm_news_lsd %>%
convert(to = "data.frame") %>%
rename("newspaper" = doc_id) %>% # when converting to data.frame, R called our grouping variable 'doc_id'. We rename it 'newspaper' instead.
mutate(sentiment = positive - negative) # create variable for overall sentiment
# plot by newspaper
tidy_dfm_news_lsd %>%
ggplot() + # when we enter ggplot environment we need to use '+' not '%>%',
geom_point(aes(x=reorder(newspaper, -sentiment), y=sentiment)) + # reordering newspaper variable so it is displayed from most negative to most positive
coord_flip() + # pivot plot by 90 degrees
xlab("Newspapers") + # label x axis
ylab("Overall tweet sentiment (negative to positive)") + # label y axis
theme_minimal() # pretty graphic theme
Difficult to interpret… Tabloids (The Daily Mirror, the Sun and the Daily Mail) seems to write overall more negative tweets than more traditional newspapers. This is especially true for The Daily Mirror. Overall it may be interesting to note that the more left-leaning papers (the Daily Mirror and the Guardian) also appear the most negative within their respective genre (tabloids and non-tabloid newspapers).
Because many of you wanted to analyse sentiment not just by newspaper but by newspaper and date, I include code to do this.
# recreate a document-feature matrix but instead of grouping it just by date or just by newspaper, we group it by both (we interact the two)
dfm_news_lsd <- dfm(toks_news_lsd) %>%
dfm_group(groups = interaction(username, date)) # we group by interaction variable between newspaper and date
# convert it to a dataframe so it's easier to use
tidy_dfm_news_lsd <- dfm_news_lsd %>%
convert(to = "data.frame")
# the interaction has batched together newspaper name and date (e.g. DailyMailUK.2020-01-01).
# We want to separate them into two distinct variables. We can do it using the command extract() and regex. It's easy because the separation is always a .
tidy_dfm_news_lsd <- tidy_dfm_news_lsd %>%
extract(doc_id, into = c("newspaper", "date"), regex = "([a-zA-Z]+)\\.(.+)")
# nice! now we again have two distinct clean variables called 'newspaper' and 'date'.
# arrange by date
tidy_dfm_news_lsd <- tidy_dfm_news_lsd %>%
mutate(date = as.Date(date)) %>% # clarify to R this is a date
arrange(date)
# recreate variable for overall sentiment
tidy_dfm_news_lsd <- tidy_dfm_news_lsd %>%
mutate(sentiment = positive - negative)
# plot
tidy_dfm_news_lsd %>%
ggplot(aes(x=date, y=sentiment)) +
geom_point(alpha=0.5) + # plot points
geom_smooth(method= loess, alpha=0.25) + # plot smooth line
facet_wrap(~newspaper, nrow = 2) + # 'faceting' means multiplying the plots so that there is one plot for each member of the group (here, sentiment) that way you can easily compare trend across group.
xlab("date") + ylab("overall sentiment (negative to positive)") +
ggtitle("Tweet sentiment trend across 8 British newspapers") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The Mirror is clealry all more negative overall, but also more dispersed whereas the Times, the Telegraph, the Guardian and Metro all show compact and stable sentiment over time. The increase in positive words use that we saw in the overall analysis seems to have been driven chiefly by the Mirror and the Sun.